GDAS002-Bioconductor备忘录

前言

这篇笔记的主要内容是关于Bioconductor的一些备忘录(cheat sheet),作者总结的不错,我就按原文翻译了下来,由于一些知识点并没有学到,因此未翻译,后面学到后,再回头过来翻译。

安装

安装细节参考http://bioconductor.org/install/,具体代码如下所示:

1
2
3
4
5
6
7
8
9
10
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install()
BiocManager::install(c("package1","package2")
BiocManager::valid() # are packages up to date?
# what Bioc version is release right now?
http://bioconductor.org/bioc-version
# what Bioc versions are release/devel?
http://bioconductor.org/js/versions.js

R语言帮助

简单的帮忙如下所示:

1
2
3
4
5
?函数名
?"eSet-class" # 如果要查看类的帮助,需要后缀'-class'
help(package="foo",help_type="html") # 使用web浏览器来查看帮助文档
vignette("topic")
browseVignettes(package="package") # 某包的简单案例

高级用户的帮助功能

1
2
3
4
5
6
7
8
9
10
11
functionName # 直接输入函数名(不带括号),则显示函数代码
getMethod(method,"class") # 显示某类的方法代码
selectMethod(method, "class") # 查看继承来的方法
showMethods(classes="class") # 显示某类的所有方法show all methods for class
methods(class="GRanges") # 此函数限于R >=3.2
?"functionName,class-method" # 用于查看S4对象的方法文档
?"plotMA,data.frame-method" # 需要加载geneplotter包,from library(geneplotter)
?"method.class" # 查看S3对象的方法
?"plot.lm"
sessionInfo() # 获取R语言版本的相关帮助信息
packageVersion("foo") # 查看某个包的版本

Bioconductor相关的论坛: https://support.bioconductor.org

如果你使用的Rstudio,则可以使用?help来查看帮忙文档,如果使用是命令行模式(例如Linux),则可以使用下面的代码通过浏览器来查看帮助文档(我使用的Rstudio,因此未尝试以下代码):

1
alias rhelp="Rscript -e 'args <- commandArgs(TRUE); help(args[2], package=args[3], help_type=\"html\"); Sys.sleep(5)' --args"

debugging R

1
2
3
4
5
6
7
8
9
10
11
12
13
14
traceback() # 查看哪一步导致了R错误
# debug a function
debug(myFunction) # 逐行遍历函数中的代码
undebug(myFunction) # 停止debugging
debugonce(myFunction) # 功能如上,但是不需要undebug()
# 在写代码的过程中,可以将函数browser()放到一个关系的节点上
# this plus devtools::load_all() can be useful for programming
# to jump in function on error:
options(error=recover)
# turn that behavior off:
options(error=NULL)
# debug, e.g. estimateSizeFactors from DESeq2...
# debugging an S4 method is more difficult; this gives you a peek inside:
trace(estimateSizeFactors, browser, exit=browser, signature="DESeqDataSet")

显示某个类的包特异方法

以下两个段代码实现的功能大致相同:查看特定的某个包中,某个特定类对象的操作方法。给定类的对象的操作方法。

1
2
3
intersect(sapply(strsplit(as.character(methods(class="DESeqDataSet")), ","), `[`, 1), ls("package:DESeq2"))
sub("Function: (.*) \\(package .*\\)","\\1",grep("Function",showMethods(classes="DESeqDataSet",
where=getNamespace("DESeq2"), printTo=FALSE), value=TRUE))

注释Annotations

这里需要先更新一下AnnotationHub,现在我们查看一下AnnotationHub的某个案例:

https://master.bioconductor.org/packages/release/bioc/html/AnnotationHub.html

以下代码是如何使用数据库包,以及biomart,先看AnnotationDbi

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 某用包个注释包
library(AnnotationDbi)
library(org.Hs.eg.db) # 这个是人类注释包,即Homo.sapiens
columns(org.Hs.eg.db)
keytypes(org.Hs.eg.db) # columns与keytypes返回的结果一样,即有什么样的注释内容
k <- head(keys(org.Hs.eg.db, keytype="ENTREZID"))
# 返回一个字符向量(1对1的关系),查看mapIds则可以查看多个选项
res <- mapIds(org.Hs.eg.db, keys=k, column="ENSEMBL", keytype="ENTREZID")
# 产生1对多的映射关系
res <- select(org.Hs.eg.db, keys=k,
columns=c("ENTREZID","ENSEMBL","SYMBOL"),
keytype="ENTREZID")

biomaRt

这个包主要用于各种基因ID的转换。

1
2
3
4
5
6
7
# 使用biomart包,将一个注释与另外一个注释进行映射
library(biomaRt)
m <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
map <- getBM(mart = m,
attributes = c("ensembl_gene_id", "entrezgene"),
filters = "ensembl_gene_id",
values = some.ensembl.genes)

GenomicRanges包

GenomicRanges这个包用于分析高通量测序数据,它能有效地表壳与操作基因组注释信息(genomic annotations)与比对结果(alignments)。GenomicRanges包定义了一个通用容器,此容器用于储存与操作基因间隔(genomic intervals)与变量(variables)。在GenomicAlignments包和SummarizedExperiment包中定义了一个更加特殊的容器,GenomicAlignments包中的这个更加特殊的容器可以根据一个参考基因组来表示和操作一些短比对结果(short alignments),SummarizedExperiment包中容器可以构建一个实验的矩阵样汇总信息。

此包的使用如下所示:

1
2
3
4
5
6
7
8
9
10
11
library(GenomicRanges)
z <- GRanges("chr1",IRanges(1000001,1001000),strand="+")
start(z)
end(z)
width(z)
strand(z)
mcols(z) # the 'metadata columns', any information stored alongside each range
ranges(z) # gives the IRanges
seqnames(z) # the chromosomes for each ranges
seqlevels(z) # the possible chromosomes
seqlengths(z) # the lengths for each chromosome

Intra-range methods

不清楚功能,后面补充。

Affects ranges independently

function description
shift moves left/right
narrow narrows by relative position within range
resize resizes to width, fixing start for +, end for -
flank returns flanking ranges to the left +, or right -
promoters similar to flank
restrict restricts ranges to a start and end position
trim trims out of bound ranges
+/- expands/contracts by adding/subtracting fixed amount
* zooms in (positive) or out (negative) by multiples

Inter-range methods

Affects ranges as a group

function description
range one range, leftmost start to rightmost end
reduce cover all positions with only one range
gaps uncovered positions within range
disjoin breaks into discrete ranges based on original starts/ends

Nearest methods

Given two sets of ranges, x and subject, for each range in x, returns…

function description
nearest index of the nearest neighbor range in subject
precede index of the range in subject that is directly preceded by the range in x
follow index of the range in subject that is directly followed by the range in x
distanceToNearest distances to its nearest neighbor in subject (Hits object)
distance distances to nearest neighbor (integer vector)

A Hits object can be accessed with queryHits, subjectHits and mcols if a distance is associated.

set methods

If y is a GRangesList, then use punion, etc. All functions have default ignore.strand=FALSE, so are strand specific.

1
2
3
union(x,y)
intersect(x,y)
setdiff(x,y)

Overlaps

1
2
3
4
x %over% y # logical vector of which x overlaps any in y
fo <- findOverlaps(x,y) # returns a Hits object
queryHits(fo) # which in x
subjectHits(fo) # which in y

Seqnames and seqlevels

GenomicRanges and GenomeInfoDb

1
2
gr.sub <- gr[seqlevels(gr) == "chr1"]
seqlevelsStyle(x) <- "UCSC" # convert to 'chr1' style from "NCBI" style '1'

Sequences

Biostrings是一个操作序列或序列集的包,简要操作可以参考Biostrings Quick Overview PDF,关于一些物种的,缩写的信息可以参考cheat sheet for annotation

1
2
3
library(BSgenome.Hsapiens.UCSC.hg19)
dnastringset <- getSeq(Hsapiens, granges) # returns a DNAStringSet
# also Views() for Bioconductor >= 3.1
1
2
library(Biostrings)
dnastringset <- readDNAStringSet("transcripts.fa")
1
2
3
4
5
6
7
8
9
10
11
12
substr(dnastringset, 1, 10) # to character string
subseq(dnastringset, 1, 10) # returns DNAStringSet
Views(dnastringset, 1, 10) # lightweight views into object
complement(dnastringset)
reverseComplement(dnastringset)
matchPattern("ACGTT", dnastring) # also countPattern, also works on Hsapiens/genome
vmatchPattern("ACGTT", dnastringset) # also vcountPattern
letterFrequecy(dnastringset, "CG") # how many C's or G's
# also letterFrequencyInSlidingView
alphabetFrequency(dnastringset, as.prob=TRUE)
# also oligonucleotideFrequency, dinucleotideFrequency, trinucleotideFrequency
# transcribe/translate for imitating biological processes

测序数据

Rsamtools中的 scanBam 返回BAM文件中的原始数值:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
library(Rsamtools)
which <- GRanges("chr1",IRanges(1000001,1001000))
what <- c("rname","strand","pos","qwidth","seq")
param <- ScanBamParam(which=which, what=what)
# for more BamFile functions/details see ?BamFile
# yieldSize for chunk-wise access
bamfile <- BamFile("/path/to/file.bam")
reads <- scanBam(bamfile, param=param)
res <- countBam(bamfile, param=param)
# for more sophisticated counting modes
# see summarizeOverlaps() below
# quickly check chromosome names
seqinfo(BamFile("/path/to/file.bam"))
# DNAStringSet is defined in the Biostrings package
# see the Biostrings Quick Overview PDF
dnastringset <- scanFa(fastaFile, param=granges)

GenomicAlignments 返回一个Bioconductor对象(GRanges-based)

1
2
3
library(GenomicAlignments)
ga <- readGAlignments(bamfile) # single-end
ga <- readGAlignmentPairs(bamfile) # paired-end

转录本数据库

GenomicFeatures

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# get a transcript database, which stores exon, trancript, and gene information
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
# or build a txdb from GTF file (e.g. downloadable from Ensembl FTP site)
txdb <- makeTranscriptDbFromGFF("file.GTF", format="gtf")
# or build a txdb from Biomart (however, not as easy to reproduce later)
txdb <- makeTranscriptDbFromBiomart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
# in Bioconductor >= 3.1, also makeTxDbFromGRanges
# saving and loading
saveDb(txdb, file="txdb.sqlite")
loadDb("txdb.sqlite")
# extracting information from txdb
g <- genes(txdb) # GRanges, just start to end, no exon/intron information
tx <- transcripts(txdb) # GRanges, similar to genes()
e <- exons(txdb) # GRanges for each exon
ebg <- exonsBy(txdb, by="gene") # exons grouped in a GRangesList by gene
ebt <- exonsBy(txdb, by="tx") # similar but by transcript
# then get the transcript sequence
txSeq <- extractTranscriptSeqs(Hsapiens, ebt)

根据范围(ranges)和实验汇总信息

SummarizedExperiment是高维数据的储存类, 它与GRangesGRangesList配合使用(使用每个基因外显子的read计数)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
library(GenomicAlignments)
fls <- list.files(pattern="*.bam$")
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
ebg <- exonsBy(txdb, by="gene")
# see yieldSize argument for restricting memory
bf <- BamFileList(fls)
library(BiocParallel)
register(MulticoreParam(4))
# lots of options in the man page
# singleEnd, ignore.strand, inter.features, fragments, etc.
se <- summarizeOverlaps(ebg, bf)
# operations on SummarizedExperiment
assay(se) # the counts from summarizeOverlaps
colData(se)
rowRanges(se)

Salmon是一个定量工具,如果数据占有GC依赖的数据,那么就添加上--gcBias参数,接着使用 tximport,以下是使用案例:

1
2
3
4
5
coldata <- read.table("samples.txt")
rownames(coldata) <- coldata$id
files <- coldata$files; names(files) <- coldata$id
txi <- tximport(files, type="salmon", tx2gene=tx2gene)
dds <- DESeqDataSetFromTximport(txi, coldata, ~condition)

Rsubread包中的featureCounts函数也可以用于read计算,速度可能更快。

1
2
3
4
5
6
library(Rsubread)
res <- featureCounts(files, annot.ext="annotation.gtf",
isGTFAnnotationFile=TRUE,
GTF.featureType="exon",
GTF.attrType="gene_id")
res$counts

RNA-seq分析方法

DESeq2:My preferred pipeline for DESeq2 users is to start with a lightweight transcript
abundance quantifier such as Salmon
and to use tximport, followed
by DESeqDataSetFromTximport.

Here, coldata is a data.frame with group as a column.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
library(DESeq2)
# from tximport
dds <- DESeqDataSetFromTximport(txi, coldata, ~ group)
# from SummarizedExperiment
dds <- DESeqDataSet(se, ~ group)
# from count matrix
dds <- DESeqDataSetFromMatrix(counts, coldata, ~ group)
# minimal filtering helps keep things fast
# one can set 'n' to e.g. min(5, smallest group sample size)
keep <- rowSums(counts(dds) >= 10) >= n
dds <- dds[keep,]
dds <- DESeq(dds)
res <- results(dds) # no shrinkage of LFC, or:
res <- lfcShrink(dds, coef = 2, type="apeglm") # shrink LFCs

edgeR

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# this chunk from the Quick start in the edgeR User Guide
library(edgeR)
y <- DGEList(counts=counts,group=group)
keep <- filterByExpr(y)
y <- y[keep,]
y <- calcNormFactors(y)
design <- model.matrix(~group)
y <- estimateDisp(y,design)
fit <- glmFit(y,design)
lrt <- glmLRT(fit)
topTags(lrt)
# or use the QL methods:
qlfit <- glmQLFit(y,design)
qlft <- glmQLFTest(qlfit)
topTags(qlft)

limma-voom

1
2
3
4
5
6
7
8
9
10
library(limma)
design <- model.matrix(~ group)
y <- DGEList(counts)
keep <- filterByExpr(y)
y <- y[keep,]
y <- calcNormFactors(y)
v <- voom(y,design)
fit <- lmFit(v,design)
fit <- eBayes(fit)
topTable(fit)

更多有关RNA-seq操作的包可以参考:Many more RNA-seq packages

表达数据集Expression set

1
2
3
4
5
6
library(Biobase)
data(sample.ExpressionSet)
e <- sample.ExpressionSet
exprs(e)
pData(e)
fData(e)

下载GEO数据库

1
2
library(GEOquery)
e <- getGEO("GSE9514")

芯片分析

1
2
3
4
5
6
7
8
library(affy)
library(limma)
phenoData <- read.AnnotatedDataFrame("sample-description.csv")
eset <- justRMA("/celfile-directory", phenoData=phenoData)
design <- model.matrix(~ Disease, pData(eset))
fit <- lmFit(eset, design)
efit <- eBayes(fit)
topTable(efit, coef=2)

iCOBRA包

iCOBRA包用于计算和可视化排序数据以及二分类匹配后的数据。例如,iCOBRA包的一个典型用途就是用于评估基因表达实验中不同组之间的比较方法,我们可以视之为一个排序问题(评估正确的效应量(effect size)以及按照显著性进行排列的基因),还是一个二分类比较问题(binary assignment problem)(例如将基因分为差异表达和非差异表达)。

1
2
3
4
5
6
7
library(iCOBRA)
cd <- COBRAData(pval=pval.df, padj=padj.df, score=score.df, truth=truth.df)
cp <- calculate_performance(cd, binary_truth = "status", cont_truth = "logFC")
cobraplot <- prepare_data_for_plot(cp)
plot_fdrtprcurve(cobraplot)
# interactive shiny app:
COBRAapp(cd)

参考资料

  1. Genomic annotation in Bioconductor: Cheat sheet

  2. HarvardX Biomedical Data Science Open Online Training